Titanic survival analysis

The Titanic survivors dataset is popularly used to illustrate concepts of data cleaning and exploration.

Let's start by importing the data to a pandas DataFrame from a CSV file:


In [1]:
import pandas as pd

In [2]:
raw_data = pd.read_csv('datasets/titanic.csv')
raw_data.head()


Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [3]:
raw_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

The information above shows that this dataset consists of data for 891 passengers: their names, gender, age, etc (for a complete description of the meaning of each column, check this link)

Missing values

Before starting the data analysis, we need to check the data's "health" by consulting how much information is actually present in each column.


In [4]:
# Percentage of missing values in each column
(raw_data.isnull().sum() / len(raw_data)) * 100.0


Out[4]:
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

It can be seen that 77% of the passengers do not present information about which cabin it was allocated to. This information could be useful for further analysis but, for now, let's drop this column:


In [5]:
raw_data.drop('Cabin', axis='columns', inplace=True)
raw_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB

The column Embarked, that informs on which port the passenger embarked, only has a few missing entries. Since the amount of passanger with missing values is negligible, they can be discarded without much harm:


In [6]:
raw_data.dropna(subset=['Embarked'], inplace=True)
(raw_data.isnull().sum() / len(raw_data)) * 100.0


Out[6]:
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.910011
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Embarked        0.000000
dtype: float64

Finally, the age is missing from around 20% of the passengers. It's not reasonable to drop all these passengers nor dropping the column as a whole, so one possible solution is to fill the missing values with the median age of the dataset:


In [7]:
raw_data.fillna({'Age': raw_data.Age.median()}, inplace=True)
(raw_data.isnull().sum() / len(raw_data)) * 100.0


Out[7]:
PassengerId    0.0
Survived       0.0
Pclass         0.0
Name           0.0
Sex            0.0
Age            0.0
SibSp          0.0
Parch          0.0
Ticket         0.0
Fare           0.0
Embarked       0.0
dtype: float64

Why use the median instead of the average?

The median represents a robust statistics. A statistics is a number that summarizes a set of values, while a statistics is said to be robust if it is not significantly affected by variations in the data.

Suppose we have a group of people whose ages are [15, 16, 14, 15, 15, 19, 14, 17]. The average age in this groupo is 15.625. If a 80-year old person gets added to this group, its average age will now be 22.77 years, which does not seem to represent well the age profile of the group. The median age of this group in both cases, instead, is 15 years - i.e. the median value was not changed by the presence of an outlier in the data, which makes it a robust statistics for the ages of the group.

Now that all of the passengers' information has been "cleaned", we can start to analyse the data.

Exploratory analysis

Let's start by exploring how many people in this dataset survived the Titanic:


In [8]:
import matplotlib.pyplot as plt
%matplotlib inline

In [9]:
overall_fig = raw_data.Survived.value_counts().plot(kind='bar')
overall_fig.set_xlabel('Survived')
overall_fig.set_ylabel('Amount')


Out[9]:
<matplotlib.text.Text at 0x119d124e0>

Overall, 38% of the passengers survived.

Now, let's segment the proportion of survivors along different profiles (the code to generate the following graphs was taken from this link).

By gender


In [10]:
survived_sex = raw_data[raw_data['Survived']==1]['Sex'].value_counts()
dead_sex = raw_data[raw_data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived_sex,dead_sex])
df.index = ['Survivors','Non-survivors']
df.plot(kind='bar',stacked=True, figsize=(15,8));


By age


In [11]:
figure = plt.figure(figsize=(15,8))
plt.hist([raw_data[raw_data['Survived']==1]['Age'], raw_data[raw_data['Survived']==0]['Age']], 
         stacked=True, color=['g','r'],
         bins=30, label=['Survivors','Non-survivors'])
plt.xlabel('Idade')
plt.ylabel('No. passengers')
plt.legend();


By fare


In [12]:
import matplotlib.pyplot as plt

figure = plt.figure(figsize=(15,8))
plt.hist([raw_data[raw_data['Survived']==1]['Fare'], raw_data[raw_data['Survived']==0]['Fare']], 
         stacked=True, color=['g','r'],
         bins=50, label=['Survivors','Non-survivors'])
plt.xlabel('Fare')
plt.ylabel('No. passengers')
plt.legend();


The graps above indicate that passenger who are female, are less than 20 years and/or paid higher fares to embark have a greater chance to have survived the Titanic (what a surprise!). How precisely can we use this information to be able to predict if a passenger would survive the accident?

Predicting chances of surviving

Let's start by preserving onle the information that we wish to use - we'll keep the passenger names for further analysis:


In [13]:
data_for_prediction = raw_data[['Name', 'Sex', 'Age', 'Fare', 'Survived']]
data_for_prediction.is_copy = False
data_for_prediction.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 5 columns):
Name        889 non-null object
Sex         889 non-null object
Age         889 non-null float64
Fare        889 non-null float64
Survived    889 non-null int64
dtypes: float64(2), int64(1), object(2)
memory usage: 41.7+ KB

Numeric encoding of Strings

Some information is encoded as strins: the information about the passenger's gender, for instance, is represented by the strings male and female. To make use of this information in our coming predictive model, we must convert them to numeric values:


In [14]:
data_for_prediction['Sex'] = data_for_prediction.Sex.map({'male': 0, 'female': 1})
data_for_prediction.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 5 columns):
Name        889 non-null object
Sex         889 non-null int64
Age         889 non-null float64
Fare        889 non-null float64
Survived    889 non-null int64
dtypes: float64(2), int64(2), object(1)
memory usage: 41.7+ KB

Training/validation set split

In order to be able to assess the model's predictive power, part of the data (in this case, 25%) must be separated into a validation set.

A validation set is a dataset for which the expected vallues are known but that is not used to train the predictive model - this way, the model will not be biased with information from these entries and this data set can be used to estimate the error rate.


In [15]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data_for_prediction, test_size=0.25, random_state=254)
len(train_data), len(test_data)


Out[15]:
(666, 223)

Predicting survival chances with decision trees

We'll use a simple Decision Tree model to predict if a passenger would survive the Titanic by making use of its gender, age and fare.


In [16]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier().fit(train_data[['Sex', 'Age', 'Fare']], train_data.Survived)
tree.score(test_data[['Sex', 'Age', 'Fare']], test_data.Survived)


Out[16]:
0.80269058295964124

With a simple decision tree, the result above indicates that it's possible to correctly predict the survival of circa 80% of the passengers.

An interesting exercise to do after training a predictive model is to take a look at the cases where it missed:


In [17]:
test_data.is_copy = False
test_data['Predicted'] = tree.predict(test_data[['Sex', 'Age', 'Fare']])
test_data[test_data.Predicted != test_data.Survived]


Out[17]:
Name Sex Age Fare Survived Predicted
207 Albimona, Mr. Nassef Cassem 0 26.00 18.7875 1 0
660 Frauenthal, Dr. Henry William 0 50.00 133.6500 1 0
81 Sheerlinck, Mr. Jan Baptist 0 29.00 9.5000 1 0
762 Barah, Mr. Hanna Assi 0 20.00 7.2292 1 0
446 Mellinger, Miss. Madeleine Violet 1 13.00 19.5000 1 0
247 Hamalainen, Mrs. William (Anna) 1 24.00 14.5000 1 0
43 Laroche, Miss. Simonne Marie Anne Andree 1 3.00 41.5792 1 0
137 Futrelle, Mr. Jacques Heath 0 37.00 53.1000 0 1
679 Cardeza, Mr. Thomas Drake Martinez 0 36.00 512.3292 1 0
821 Lulic, Mr. Nikola 0 27.00 8.6625 1 0
508 Olsen, Mr. Henry Margido 0 28.00 22.5250 0 1
357 Funk, Miss. Annie Clemmer 1 38.00 13.0000 0 1
748 Marvin, Mr. Daniel Warner 0 19.00 53.1000 0 1
288 Hosono, Mr. Masabumi 0 42.00 13.0000 1 0
712 Taylor, Mr. Elmer Zebley 0 48.00 52.0000 1 0
238 Pengelly, Mr. Frederick William 0 19.00 10.5000 0 1
804 Hedman, Mr. Oskar Arvid 0 27.00 6.9750 1 0
71 Goodwin, Miss. Lillian Amy 1 16.00 46.9000 0 1
429 Pickard, Mr. Berk (Berk Trembisky) 0 32.00 8.0500 1 0
498 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 1 25.00 151.5500 0 1
692 Lam, Mr. Ali 0 28.00 56.4958 1 0
147 Ford, Miss. Robina Maggie "Ruby" 1 9.00 34.3750 0 1
245 Minahan, Dr. William Edward 0 44.00 90.0000 0 1
18 Vander Planke, Mrs. Julius (Emelia Maria Vande... 1 31.00 18.0000 0 1
259 Parrish, Mrs. (Lutie Davis) 1 50.00 26.0000 1 0
271 Tornquist, Mr. William Henry 0 25.00 0.0000 1 0
339 Blackwell, Mr. Stephen Weart 0 45.00 35.5000 0 1
314 Hart, Mr. Benjamin 0 43.00 26.2500 0 1
209 Blank, Mr. Henry 0 40.00 31.0000 1 0
440 Hart, Mrs. Benjamin (Esther Ada Bloomfield) 1 45.00 26.2500 1 0
673 Wilhelms, Mr. Charles 0 31.00 13.0000 1 0
678 Goodwin, Mrs. Frederick (Augusta Tyler) 1 43.00 46.9000 0 1
100 Petranec, Miss. Matilda 1 28.00 7.8958 0 1
400 Niskanen, Mr. Juha 0 39.00 7.9250 1 0
17 Williams, Mr. Charles Eugene 0 28.00 13.0000 1 0
449 Peuchen, Major. Arthur Godfrey 0 52.00 30.5000 1 0
312 Lahtinen, Mrs. William (Anna Sylfven) 1 26.00 26.0000 0 1
637 Collyer, Mr. Harvey 0 31.00 26.2500 0 1
796 Leader, Dr. Alice (Farnham) 1 49.00 25.9292 1 0
737 Lesurer, Mr. Gustave J 0 35.00 512.3292 1 0
146 Andersson, Mr. August Edvard ("Wennerstrom") 0 27.00 7.7958 1 0
862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... 1 48.00 25.9292 1 0
78 Caldwell, Master. Alden Gates 0 0.83 29.0000 1 0
473 Jerwan, Mrs. Amin S (Marie Marthe Thuillard) 1 23.00 13.7917 1 0

One example of a wrong prediction above is the case of passenger named Mrs. Hudson J C Allison, that didn't survive the Titanic despite being a female person, being 25 years old and having paid an expensive fare. A search on Encyclopedia Titanica reveals that she was informed, after having been put into a lifeboat, that her son was embarked in another lifeboat in the opposite side of the ship - Mrs. Allison then ran away from her boat in an attempt to reach to her son but to no avail.

A particularly interesting collection of stories related to the Titanic passengers can be found in this post.